This document is based on Gemma’s code.
This is a template for fitting sdmTMB to GOA bottom trawl data. For each species, it fits sdmTMB to life stages (juveniles or adults) to predict CPUE in number of individuals per km\(^{2}\). This predicted CPUE is then scaled up to box area in Atlantis, to get numbers of individuals per box and proportion of the total, which is what we need to initialise Atlantis. This does not include British Columbia.
This workflow is based on the following assumptions:
select <- dplyr::select
load("data/cpue.atf.A.Rdata")
race_data <- as_tibble(dat_stage)
This is RACE data from AKFIN. There are a lot of fields, so let’s select what we need and drop the rest. Note: for lat and lon, for now I am using the start values for each tow, but it would be easy to get the coordinates for the midpoint of the tow.
fields <- c("YEAR",
"HAULJOIN",
"LAT",
"LON",
"DEPTHR",
"SST",
"TEMPR",
"SPECIES_CODE",
"CN",
"BIOM_KGKM2",
"NUM_KM2",
"STAGE")
race_data <- race_data %>% select(all_of(fields)) %>% set_names(c(
"year",
"hauljoin",
"lat",
"lon",
"depth",
"stemp",
"btemp",
"species_code",
"name",
"biom_kgkm2",
"num_km2",
"stage"))
Take a quick look at the data spatially.
# coast for plotting
load("data/goa_coast.Rdata")
coast_sf <- st_as_sf(coast) # turn coast to sf
ggplot()+
geom_point(data = race_data, aes(lon, lat, colour = log1p(num_km2)), size = 1.5)+
scale_colour_viridis_c()+
geom_polygon(data = coast, aes(x = long, y = lat, group = group), colour = "black", fill = "grey80")+
theme_minimal()+
facet_wrap(~year, ncol = 2)+
labs(title = paste(race_data$name,"CPUE from GOA bottom trawl survey - stage:", race_data$stage, sep = " "))
## Regions defined for each Polygons
Take a quick look at time series of total numbers CPUE from raw data
num_year <- race_data %>% group_by(year) %>% summarise(num = sum(log1p(num_km2)))
ggplot(num_year, aes(year, log(num)))+
geom_point()+
geom_path()+
theme_minimal()+
labs(title = paste(race_data$name,"total GOA CPUE from bottom \n trawl survey - stage:", race_data$stage, sep = " "))
The above is across the entire GOA.
Kirstin’s CPUE data includes empty hauls (i.e. zero catches). However, in the step where aggregate the size bin data into life-stage data (not included in this script), we loose those empty hauls (as BIN will be NA for a haul whith zero catch in Kirstin’s data set, which causes zero cathces to be dropped when I join by HAULJOIN). So, we need to add them back to the RACE data here. To do this, I take haul information from the “Haul Descriptions” data set on AKFIN. I then subtract the RACE hauls with catch in race_data from the “Haul Descriptions” list to see which hauls were empty for this particular species / life stage. Then pad these empty hauls with zero CPUEs, and attach them to race_data.
load("data/hauls.Rdata") # as accessible on AKFIN Answers with no further modifications
data_hauls <- levels(factor(race_data$hauljoin))
zero_hauls <- setdiff(levels(factor(hauls$Haul.Join.ID)), data_hauls) # assuming that if there are no records from a haul, the catch in that haul was 0 for this species
# make a data frame to bind by row
zero_catches <- hauls %>% filter(Haul.Join.ID %in% zero_hauls) %>%
select(Year, Haul.Join.ID, Ending.Latitude..dd., Ending.Longitude..dd., Bottom.Depth..m., Surface.Temperature..C., Gear.Temperature..C.) %>%
mutate(species_code = rep(NA, length(Year)),
name = rep(NA, length(Year)),
biom_kgkm2 = rep(0, length(Year)),
num_km2 = rep(0, length(Year)),
stage = rep(NA, length(Year))) %>%
set_names(names(race_data))
# attach by row to race_data
race_data <- rbind(race_data, zero_catches)
# ditch hauls with empty lat lon
race_data <- race_data %>% filter(!is.na(lat) & !is.na(lon))
# and with NA depths
race_data <- race_data %>% filter(!is.na(depth))
This is the mesh that the sdmTMB algorithm uses to estimate spatial autocorrelation. The speed of model running is highly dependent on number of knots. 100 is quite low, you’d want to make sure it was robust by checking multiple resolutions when you have a model you want to actually use (people use like 350-450 or something).
Note: SPDE = Stochastic Partial Differential Equations approach. Some material can be found here, but basically it is a way of calculating the position of the mesh knots.
race_spde <- make_mesh(race_data, c("lon", "lat"), n_knots = 100) # usually 450
plot(race_spde)
Check out the distribution of the biomass density response variable.
hist(race_data$num_km2, breaks = 30)
hist(log1p(race_data$num_km2), breaks = 30)
Proportion of zeroes in percentage.
length(which(race_data$num_km2 == 0))/nrow(race_data)*100
## [1] 28.02988
Try running a model with smooth term for depth. Using 5 knots for the smooth - but this is arbitrary and I should test a range of values and compare. As a note, I am not scaling depth here. The reason is that depth has a different range in the data and the prediction grid, and scaled values have different meaning.
Question for Gemma: Have you ever used convergence metrics produced by the sdmTMB() function to evaluate model fit? Calling m_depth$model provides some information on convergence and iterations of the optimization process, but I confess I am not familiar with these, especially in this particular package. My understanding is that sdmTMB() uses the nlminb() optimization routine, and going off of some hints here and here I am guessing that a simple check that m_depth$model$convergence = 0 and m_depth$model$message = 3,4,5,6 could be a start. I will do more reading, I was just wondering whether you look at this at all or not.
Check out model residuals.
race_data$resids <- residuals(m_depth) # randomized quantile residuals
hist(race_data$resids)
And QQ plot.
qqnorm(race_data$resids)
abline(a = 0, b = 1)
One clear outlier?
Plot the response curve from the depth smooth term.
plot(m_depth$mgcv_mod, rug = TRUE)
Finally, plot the residuals in space. If residuals are constantly larger/smaller in some of the areas, it may be sign that the model is biased and it over/underpredicts consistently for some areas. Residuals should be randomly distributed in space. We need to read in the Atlantis BGM file to do that, as we need the right projection.
Read in BGM and coast.
atlantis_bgm <- read_bgm("data/GOA_WGS84_V4_final.bgm")
race_sf <- race_data %>% st_as_sf(coords = c(x = "lon", y = "lat"), crs = "WGS84") %>% st_transform(crs = atlantis_bgm$extra$projection) # turn to spatial object
ggplot()+
geom_sf(data = race_sf, aes(color = resids, alpha = .8))+
scale_color_viridis()+
geom_sf(data = coast_sf)+
theme_minimal()+
labs(title = paste(race_data$name,"model residuals in space - stage:", race_data$stage, sep = " "))+
facet_wrap(~year, ncol = 2)
Take a grid (which must contain information on the predictors we used to build the model) and predict the biomass index over such grid based on the predictors. The grid is currently a regular grid with 10-km cell size, but 10 km might not be enough to get prediction points in all boxes - especially for a couple very small and narrow boxes at the western end of the model domain. Revisit this if necessary, but a finer mesh could be difficult to justify compared to the density of the survey data. The grid covers the entire Atlantis model domain, including the non-dynamic boundary boxes (deeper than 1000 m).
Read in the Atlantis prediction grid (10 km) modified in Atlantis_grid_covars.R (code not included here).
atlantis_boxes <- atlantis_bgm %>% box_sf()
Important: depth in the RACE data is a positive number. Depth in the prediction grid we obtained from the ETOPO rasters is a negative number. When we use depth as predictor for in our regular grid, make sure depth is a positive number for consistency with the model variable, or else everything will be upside-down.
load("data/atlantis_grid_depth.Rdata")
atlantis_grid_depth <- atlantis_grid_depth %>% mutate(depth = -depth) # making sure that depth has the same sign/orientation in the data and prediction grid
# add coordinate columns
atlantis_coords <- atlantis_grid_depth %>% st_as_sf(coords = c("x", "y"), crs = atlantis_bgm$extra$projection) %>%
st_transform(crs = "+proj=longlat +datum=WGS84") %>% dplyr::select(geometry)
atlantis_grid <- cbind(atlantis_grid_depth, do.call(rbind, st_geometry(atlantis_coords)) %>%
as_tibble() %>% setNames(c("lon","lat")))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
paste("Positive depths are:", length(which(atlantis_grid$depth>0)), "out of:", nrow(atlantis_grid_depth), sep = " ") # Write out a check that depths are positive (few negatives are OK - they are on land - I'll fix it but it should not matter as island boxes will be boundary boxes in Atlantis)
## [1] "Positive depths are: 6237 out of: 6259"
# add year column
all_years <- levels(factor(race_data$year))
atlantis_grid <- atlantis_grid[rep(1:nrow(atlantis_grid), length(all_years)),]
atlantis_grid$year <- as.integer(rep(all_years, each = nrow(atlantis_grid_depth)))
Make SDM predictions onto new data from depth model. Back-transforming here, is this sensible?
predictions_race <- predict(m_depth, newdata = atlantis_grid, return_tmb_object = TRUE)
atlantis_grid$estimates <- exp(predictions_race$data$est) #Back-transforming here, is this sensible?
atlantis_grid_sf <- atlantis_grid %>% st_as_sf(coords = c("x", "y"), crs = atlantis_bgm$extra$projection) # better for plots
Not plotting most of Canada because the predictions look terrible (due to not having biomass data from there in this model).
ggplot()+
geom_sf(data = subset(atlantis_boxes, box_id < 92), aes(fill = NULL))+
geom_sf(data = subset(atlantis_grid_sf, box_id < 92), aes(color=log1p(estimates)))+ # taking the log for visualisation
geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
scale_color_viridis(name = expression(paste("Log(CPUE) num ", km^-2)))+
theme_minimal()+
labs(title = paste(race_data$name,"predicted CPUE - stage:", race_data$stage, sep = " "))+
facet_wrap(~year, ncol = 2)
Attribute the predictions to their respective Atlantis box, so that we can take box averages.
atlantis_grid_means <- atlantis_grid %>% group_by(year, box_id) %>%
summarise(mean_estimates = mean(estimates), na.rm = TRUE) %>% ungroup()
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
# join this with the box_sf file
predictions_by_box <- atlantis_boxes %>% inner_join(atlantis_grid_means, by = "box_id")
See estimates per box for all years. Silence boundary boxes as they throw the scale out of whack (and they do not need predictions).
predictions_by_box <- predictions_by_box %>% rowwise() %>% mutate(mean_estimates = ifelse(isTRUE(boundary), NA, mean_estimates))
ggplot()+
geom_sf(data = predictions_by_box[predictions_by_box$box_id<92,], aes(fill = log1p(mean_estimates)))+ # taking the log for visualisation
scale_fill_viridis(name = expression(paste("Log(CPUE) num ", km^-2)))+
theme_minimal()+
geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
facet_wrap(~year, ncol = 2)+
labs(title = paste(race_data$name, "mean predicted CPUE by Atlantis box - stage:", race_data$stage, sep = " "))
Plot the raw data again for comparison.
ggplot()+
geom_point(data = race_data, aes(lon, lat, colour = log1p(num_km2)), size = 1.5, alpha = .5)+ # taking the log for visualisation
scale_colour_viridis_c(name = expression(paste("Log(CPUE) num ", km^-2)))+
geom_polygon(data = coast, aes(x = long, y = lat, group = group), colour = "black", fill = "grey80")+
theme_minimal()+
facet_wrap(~year, ncol = 2)+
labs(title = paste(race_data$name,"CPUE from GOA bottom trawl survey - stage:", race_data$stage, sep = " "))
## Regions defined for each Polygons
Check also depth distributions.
ggplot(data = race_data, aes(depth))+
geom_histogram(colour = "black", fill = 'grey80')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interpret this one with care: does it mean that the species is found more frequently at a certain depth, or that the sampling happens most frequently at a certain depth?
Plot data and predictions distributions.
ggplot(data = race_data, aes(log1p(num_km2)))+
geom_histogram(colour = "black", fill = 'grey80')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = predictions_by_box, aes(log1p(mean_estimates)))+
geom_histogram(colour = "black", fill = 'grey80')+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 224 rows containing non-finite values (stat_bin).
Now calculate means of the predictions for the entire study period. Doing it by taking 1984-2019 averages for each Atlantis box.
means_all_years <- predictions_by_box %>% group_by(box_id, area, boundary) %>% summarise(all_years_numkm2 = mean(mean_estimates))
## `summarise()` has grouped output by 'box_id', 'area'. You can override using the `.groups` argument.
ggplot()+
geom_sf(data = means_all_years[means_all_years$box_id < 92,], aes(fill = log1p(all_years_numkm2)))+ # log for visualisation
scale_fill_viridis(name = expression(paste("Log(CPUE) num ", km^-2)))+
geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
theme_minimal()+
labs(title = paste(race_data$name, "mean predicted CPUE by Atlantis box (1984-2019) - stage:", race_data$stage, sep = " "))
Trying to evaluate model skill by having a look at how well model predictions align with observations.
Since this is a spatially-explicit approach, we need observations and predictions at the same location. One approach would be to look at things box-wise by looking at relationships between average CPUE from data per box and average predicted CPUE per box. While this makes sense (boxes are our unit of space in Atlantis after all), I did some testing and it does not come without issues. For example, some boxes have no data points in certain years, but have a constant number of predicted points from the regular grid every year. So, my approach at the moment is to use the locations of all RACE hauls as a prediction grid. Is this circular?
#make a prediction grid from the race data itself
race_grid_tmp <- race_data %>% dplyr::select(lon, lat, depth)
# add year
race_grid <- race_grid_tmp[rep(1:nrow(race_grid_tmp), length(all_years)),]
race_grid$year <- as.integer(rep(all_years, each = nrow(race_grid_tmp)))
# predict on this grid
predictions_at_locations <- predict(m_depth, newdata = race_grid, return_tmb_object = TRUE)
race_grid$predictions <- exp(predictions_at_locations$data$est) # back-transforming here
Now join by year and coordinates to have predictions at the sampling points.
race_corr <- race_data %>% left_join(race_grid, by = c("year", "lat", "lon"))
paste0("Pearson's coef observations vs predictions: ", cor(race_corr$num_km2, race_corr$predictions, use = "everything", method = "pearson"))
## [1] "Pearson's coef observations vs predictions: 0.424214302455569"
What is a good value here?
Plot.
ggplot(race_corr, aes(x = log1p(predictions), y = log1p(num_km2)))+ # log for visualisation
geom_point(aes(color = depth.y))+
scale_color_viridis()+
geom_abline(intercept = 0, slope = 1)+
theme_minimal()+
facet_wrap(~year, scales = "free")+
labs(title = paste(race_data$name, "observed vs predicted CPUE. Stage: ", race_data$stage, sep = " "))
Question for Gemma: when I use the model to predict back onto the original data, the model seems to get the zeroes wrong (i.e. it predicts there to be fish where the data said there wasn’t). At quick glance, this would appear to happen more (but not only) for deeper depths. Have you ever observed this, and if yes, do you have any tips on how I should be thinking about it?
Calculate RMSE between predicted and observed values.
paste("RMSE:", sqrt(mean(race_corr$predictions - race_corr$num_km2)^2), " num km-2", sep = " ") ### traditional rmse metric, in units num km2
## [1] "RMSE: 113.153678479765 num km-2"
Normalised RMSE.
rmse_cv <- sqrt(mean((race_corr$predictions - race_corr$num_km2)^2))/(max(race_corr$num_km2)-min(race_corr$num_km2))*100 #### normalised rmse, expressed as a % of the range of observed biomass values, sort of approximates a coefficient of variation
paste("Normalised RMSE:", paste0(rmse_cv, "%"), sep = " ")
## [1] "Normalised RMSE: 1.85240728703426%"
What is a good value here?
The current estimated CPUE is in num km\(^{-2}\). So, just I just turn that into fish per box (biomass pools groups will follow the same workflow except the model will predict biomass CPUE). Remember that the area is in m\(^2\) for the boxes, so need to divide by 1,000,000.
means_all_years <- means_all_years %>% mutate(numbers = all_years_numkm2*area*1e-06)
total_numbers <- sum(means_all_years$numbers, na.rm = TRUE) # get total numbers
means_all_years <- means_all_years %>% mutate(proportion = numbers/total_numbers) # get proportion from each box - needs to sum to 1
Of course this is missing Canada at the moment.
Plot.
ggplot()+
geom_sf(data = means_all_years[means_all_years$box_id < 92,], aes(fill = proportion))+
scale_fill_viridis(name = "Box numbers as proportion \n of total numbers")+
geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
theme_minimal()+
labs(title = paste(race_data$name, "proportion of individuals (S1-S4) by box. Stage:", race_data$stage, sep = " "))
means_all_years %>% select(box_id, all_years_numkm2, numbers, proportion) %>% st_set_geometry(NULL) %>% kable(align = 'lccc', format = "markdown",
col.names = c("Box", "CPUE (num km-2)", "Number of individuals", "Proportion"))
| Box | CPUE (num km-2) | Number of individuals | Proportion |
|---|---|---|---|
| 0 | NA | NA | NA |
| 2 | NA | NA | NA |
| 3 | 56.809886 | 55675.94 | 0.0001174 |
| 4 | 405.887012 | 845260.56 | 0.0017825 |
| 5 | 263.407405 | 698995.84 | 0.0014741 |
| 6 | 80.477150 | 164947.99 | 0.0003478 |
| 7 | 501.114872 | 1667638.02 | 0.0035168 |
| 8 | NA | NA | NA |
| 9 | 776.165223 | 1955134.63 | 0.0041230 |
| 10 | 279.252381 | 103528.74 | 0.0002183 |
| 11 | NA | NA | NA |
| 12 | 300.483753 | 1504844.19 | 0.0031735 |
| 13 | 503.004589 | 4950178.62 | 0.0104391 |
| 14 | 50.362894 | 88477.68 | 0.0001866 |
| 15 | 459.773569 | 644191.55 | 0.0013585 |
| 16 | 134.324738 | 472384.72 | 0.0009962 |
| 17 | 928.334553 | 3771601.76 | 0.0079536 |
| 18 | 1515.431361 | 12266223.75 | 0.0258673 |
| 19 | 61.646343 | 83218.44 | 0.0001755 |
| 20 | 927.654439 | 1105837.69 | 0.0023320 |
| 22 | 226.552944 | 561934.61 | 0.0011850 |
| 23 | 802.772906 | 1423936.01 | 0.0030028 |
| 24 | 2066.143269 | 17359622.22 | 0.0366084 |
| 25 | 730.260826 | 6711218.81 | 0.0141528 |
| 26 | 324.272803 | 759143.46 | 0.0016009 |
| 27 | 1699.059809 | 9450817.46 | 0.0199301 |
| 28 | 770.585087 | 2320120.12 | 0.0048927 |
| 29 | 3109.354318 | 23060831.34 | 0.0486312 |
| 30 | NA | NA | NA |
| 31 | 71.779802 | 420445.16 | 0.0008866 |
| 32 | 2858.946647 | 3202006.47 | 0.0067525 |
| 33 | 1830.415304 | 7126082.53 | 0.0150277 |
| 34 | 4858.161882 | 21683031.99 | 0.0457257 |
| 35 | 363.044855 | 2502166.41 | 0.0052766 |
| 36 | 5624.779888 | 4469712.26 | 0.0094258 |
| 37 | 2969.517189 | 22006052.09 | 0.0464069 |
| 38 | 178.459061 | 661636.71 | 0.0013953 |
| 39 | 5419.000103 | 30933738.49 | 0.0652338 |
| 41 | 1623.451566 | 1057493.94 | 0.0022301 |
| 42 | 3582.802388 | 15039926.08 | 0.0317166 |
| 43 | 853.553635 | 1654299.55 | 0.0034886 |
| 44 | 311.150334 | 544262.39 | 0.0011478 |
| 45 | 1290.336379 | 9153930.96 | 0.0193040 |
| 46 | 66.202567 | 356282.43 | 0.0007513 |
| 47 | 2433.369123 | 13802736.00 | 0.0291075 |
| 48 | 3910.208111 | 66468272.24 | 0.1401699 |
| 49 | 742.479989 | 2604141.15 | 0.0054917 |
| 50 | 1424.159043 | 6245183.00 | 0.0131700 |
| 51 | 5125.260279 | 34004775.81 | 0.0717101 |
| 52 | 3934.211859 | 14989464.37 | 0.0316101 |
| 53 | NA | NA | NA |
| 54 | 1951.189953 | 1446957.52 | 0.0030514 |
| 55 | 1737.229406 | 19407644.57 | 0.0409273 |
| 56 | 1341.252080 | 6811815.08 | 0.0143649 |
| 57 | 126.441355 | 14633.38 | 0.0000309 |
| 58 | NA | NA | NA |
| 59 | 42.721686 | 93996.28 | 0.0001982 |
| 60 | 857.371277 | 794192.50 | 0.0016748 |
| 61 | 1209.364266 | 2151593.66 | 0.0045373 |
| 62 | NA | NA | NA |
| 64 | 740.572990 | 4024536.96 | 0.0084870 |
| 65 | 363.738578 | 384665.69 | 0.0008112 |
| 66 | 1258.234211 | 1339574.00 | 0.0028249 |
| 67 | 1435.495993 | 8609303.75 | 0.0181555 |
| 68 | 231.926189 | 189854.72 | 0.0004004 |
| 69 | NA | NA | NA |
| 70 | 297.426944 | 259384.91 | 0.0005470 |
| 71 | 1012.825452 | 4897396.03 | 0.0103277 |
| 72 | 36.782728 | 68395.67 | 0.0001442 |
| 73 | 475.466788 | 739768.99 | 0.0015600 |
| 74 | 931.596990 | 5025696.42 | 0.0105983 |
| 75 | 1102.809315 | 4701391.41 | 0.0099144 |
| 76 | 627.949370 | 1936106.99 | 0.0040829 |
| 77 | 35.371845 | 24181.89 | 0.0000510 |
| 78 | 973.251184 | 3303209.92 | 0.0069659 |
| 79 | 1249.890921 | 14167754.81 | 0.0298773 |
| 80 | 335.389628 | 313816.30 | 0.0006618 |
| 81 | NA | NA | NA |
| 82 | 1405.310584 | 2536020.96 | 0.0053480 |
| 83 | 1756.283454 | 4094352.87 | 0.0086343 |
| 84 | 350.871010 | 611204.23 | 0.0012889 |
| 85 | 49.420984 | 176067.51 | 0.0003713 |
| 87 | 1308.727381 | 1544417.65 | 0.0032569 |
| 88 | 574.527673 | 4113398.85 | 0.0086744 |
| 89 | NA | NA | NA |
| 90 | 935.238076 | 4117113.61 | 0.0086823 |
| 91 | 743.789670 | 1158120.07 | 0.0024423 |
| 92 | 212.264638 | 631343.26 | 0.0013314 |
| 93 | 7.992377 | 20488.83 | 0.0000432 |
| 94 | 388.328689 | 1168122.69 | 0.0024634 |
| 95 | NA | NA | NA |
| 96 | 179.641183 | 260401.52 | 0.0005491 |
| 97 | NA | NA | NA |
| 98 | 432.317823 | 1400922.40 | 0.0029543 |
| 100 | 43.291544 | 331572.36 | 0.0006992 |
| 101 | 70.692505 | 320821.32 | 0.0006766 |
| 102 | 384.045427 | 3041148.79 | 0.0064132 |
| 103 | 626.193338 | 3443703.76 | 0.0072622 |
| 104 | 720.032571 | 6246321.59 | 0.0131724 |
| 105 | 80.534252 | 196036.84 | 0.0004134 |
| 106 | 159.149057 | 478498.83 | 0.0010091 |
| 107 | 700.462028 | 6650536.45 | 0.0140248 |
| 108 | NA | NA | NA |